This is my personal project page for the University of Helsinki statistics course named “Introduction to Open Data Science”.
My name is Pinja-Liina Jalkanen.
Please be forewarned that my statistics skills are rather crappy and usually limited to some simple descriptives and regressions. I do know some tech stuff quite well, though, so I don’t expect great difficulties with the R syntax, Git or the like. Kimmo Vehkalahti said on the first course lesson that data analysis is often 90% preparation of the data and 10% doing the analysis. I don’t know about the weighing on this course, but I usually don’t need help with the 90% at all; that’s mostly like, RTFM, STFW and the like anyway.
With regard to the 10% though, I don’t trust my skills at all.
Link to this project: https://github.com/pinjaliina/IODS-project
The data of the analysis in this exercise was the Autumn 2014 Introductory Statistics Course Learning Questionnaire. The data has 60 variables and 183 observations; except for the few background related variables (age, attitude towards statistics, course points), most of the questions were learning-related questions with answers given on Likert scale, from 1 to 5.
For the analysis, I’ve averaged the variables related to deep learning, surface learning, and strategic learning. (For the initial data wrangling part of this exercise, see this R script).
The data is read in as follows:
# Read in the data
lrn2014a <- as.data.frame(read.table('data/learning2014.csv', sep="\t", header=TRUE))
A scatter plot matrix of the variables of the data can be drawn as follows, coloured by the gender variable:
p <- ggpairs(lrn2014a, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
A summary of each of the variables of the data can be displayed as follows:
summary(lrn2014a)
## age attitude points gender
## Min. :17.00 Min. :14.00 Min. : 7.00 F:110
## 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 M: 56
## Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep stra surf
## Min. :1.625 Min. :1.583 Min. :1.250
## 1st Qu.:3.500 1st Qu.:2.417 1st Qu.:2.625
## Median :3.875 Median :2.833 Median :3.188
## Mean :3.796 Mean :2.787 Mean :3.121
## 3rd Qu.:4.250 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.875 Max. :4.333 Max. :5.000
As can be seen from the output, most of the variables – with the exception of the age of the students – are distributed quite randomly and only correlate weakly with the course points. The most significant exception is the attitude towards statistics, which correlates more strongly with the course points than any other variable. Except for the age and points, the distribution of most of the variables seems to be reasonably close to normal distribution. The gender variable demonstrates that the course was attended by significantly more female than male students.
A linear regression model can be fit to the data as follows:
lrn2014_model <- lm(points ~ attitude + stra + surf, data = lrn2014a)
The chosen explanatory variables for the model were attitude, strategic learning and surface learning. The summary of the model can be printed as follows:
summary(lrn2014_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra -0.58607 0.80138 -0.731 0.46563
## surf 0.85313 0.54159 1.575 0.11716
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
However, as can be seen from the model summary, the estimates of strategic and surface learning have no statistical significance explaining the course points; given the weak correlation of those variables with the points this was somewhat expected. It thus makes more sense to eliminate at least the variable that has the lowest probability value, strategic learning:
lrn2014_model <- lm(points ~ attitude + surf, data = lrn2014a)
summary(lrn2014_model)
##
## Call:
## lm(formula = points ~ attitude + surf, data = lrn2014a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## surf 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
With the removal of that value the statistical significance of the surface learning estimates improves somewhat to near significant levels and can be left to the model. In practice its effect to the model is so tiny that it nearly as well be left out, but because it is reasonably close to statistical significance and also gives highest adjusted R2 value – the adjusted R2 of a model where attitude is the only explanatory variable as reported by summary() would be 0.1856 – its inclusion to the model is justified.
While the model estimates are statistically significant, the model as a whole is not very good: the multiple R2 value is only 0.20, which means that about 80 % of the relationship between the dependent variable – course points – and the explanatory variables remains unexplained. Thus, any predictions based on the model alone might not be very reliable.
In addition to linearity, linear models are fitted with the assumption that:
The validity of these assumption can be tested by analysing the residuals of the model. This can be done with the help of different kinds of diagnostic plots. In the following figure three different plots are drawn:
par(mfrow = c(1,3)) # Set some graphical params.
# It seems that the drawing order of the plot is independent of the vector
# order and can't be changed.
plot(lrn2014_model, which = c(1,2,5))
Interpretation of the plots:
The data of the analysis in this exercise was Fabio Pagnotta’s and Hossain Mohammad Amran’s Using Data Mining To Predict Secondary School Student Alcohol Consumption (2008), published by Department of Computer Science of the University of Camerino (link: https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION). For the analysis, I’ve also created a combined total alcohol consumption variable (sum of weekday and weekend consumption) and created a separate logical high use variable, based on the total consumption. (For the initial data wrangling part of this exercise, see this R script).
The data is read in as follows:
# Read in the data
alc <- as.data.frame(read.table('data/alc.csv', sep="\t", header=TRUE))
A glimpse of all of the variables of the data can be displayed as follows:
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, ser...
## $ Fjob <fctr> teacher, other, other, services, other, other...
## $ reason <fctr> course, course, other, home, home, reputation...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0,...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14,...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14,...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE...
As can be seen from the above variable list, there are both numerical and factorial background variables. The target variable for this analysis is the binary high/low alcohol consumption variable. To analyse that, I’ve chosen the following four variables that I assume are indicative of students’ alcohol consumption:
A summary and some plots of the chosen variables are shown below (boxplots’ whiskers extend to 75% of the interquartile range). I also grouped the box plots by sex to see any potential differences between them:
summary(alc[c('absences','goout','G3','studytime')])
## absences goout G3 studytime
## Min. : 0.000 Min. :1.000 Min. : 0.00 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:2.000 1st Qu.: 8.00 1st Qu.:1.000
## Median : 3.000 Median :3.000 Median :11.00 Median :2.000
## Mean : 5.319 Mean :3.113 Mean :10.39 Mean :2.034
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.00 3rd Qu.:2.000
## Max. :75.000 Max. :5.000 Max. :20.00 Max. :4.000
p1 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + xlab('high use')
p2 <- ggplot(alc, aes(goout)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('going out')
p3 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab('final grade') + xlab('high use')
p4 <- ggplot(alc, aes(studytime)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count")
# The multiplot() is defined in the init section of this file. For details, see
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
multiplot(p1, p2, p3, p4, cols = 2)
Based on the above exploration of the variables, it looks like all my previously stated hypothetical assumptions were true to at least some extent, with perhaps the exception of final grade. To confirm this, I did a logistic regression analysis using my chosen variables as the explanatory variables.
In the following, a model is fitted to the data and summarised.
m <- glm(high_use ~ absences + goout + G3 + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + G3 + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8166 -0.7495 -0.5034 0.8071 2.5083
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.62195 0.62727 -4.180 2.92e-05 ***
## absences 0.05600 0.01643 3.407 0.000656 ***
## goout 0.74536 0.12013 6.204 5.49e-10 ***
## G3 0.01269 0.02848 0.446 0.655854
## studytime -0.60004 0.16852 -3.561 0.000370 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 383.16 on 377 degrees of freedom
## AIC: 393.16
##
## Number of Fisher Scoring iterations: 4
By the model summary, it looks like my hypothesis was wrong with regard the final grade, which wasn’t a good predictor at all of high alcohol consumption: it bears no statistical significance whatsoever to it. All the other chosen explanatory variables are relatively strong predictors. Calculating the odds ratios hints to the same direction:
or <- exp(coef(m))
or
## (Intercept) absences goout G3 studytime
## 0.07266094 1.05759534 2.10719756 1.01277150 0.54879019
As shown by the odds ratios, a student has virtually the same likelihood to consume much alcohol regardless of the grade. Interestingly, the absences are not much of a factor either: a student with lot of absences is only 1.06 times more likely to consume a lot of alcohol. With regard outgoingness and studytime the results are, however, very clear: a student who goes out a lot is two times more likely to also drink a lot. Same goes for studytime: a student who studies a lot is almost half less likely to drink a lot. But while absence doesn’t increase the likelihood of high use a lot, comparing its odd ratio to its confidence interval still confirms its statistical significance:
ci <- exp(confint(m))
cbind(or, ci)
## or 2.5 % 97.5 %
## (Intercept) 0.07266094 0.02042166 0.2402599
## absences 1.05759534 1.02602413 1.0954317
## goout 2.10719756 1.67554867 2.6863511
## G3 1.01277150 0.95846241 1.0720167
## studytime 0.54879019 0.38995929 0.7564410
As shown by the confidence intervals of the odd ratios, the confidence intervals of all the other explanatory variables except that of final grade (G3) steer well clear of one, which means that the likelihoods that the odds predicted by them are low. With regard the final grade, however, my initial hypothesis was clearly wrong because the value one is almost in the middle of its confidence interval. Thus, before making any actual predictions using the model, it’s best to refit it without that variable:
m <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8376 -0.7530 -0.4927 0.8091 2.4967
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.47603 0.53168 -4.657 3.21e-06 ***
## absences 0.05627 0.01651 3.409 0.000651 ***
## goout 0.73711 0.11835 6.228 4.72e-10 ***
## studytime -0.59422 0.16800 -3.537 0.000405 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 383.36 on 378 degrees of freedom
## AIC: 391.36
##
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m)), exp(confint(m)))
## 2.5 % 97.5 %
## (Intercept) 0.08407616 0.02874986 0.2322992
## absences 1.05788483 1.02622976 1.0958710
## goout 2.08988646 1.66701101 2.6539672
## studytime 0.55199236 0.39259400 0.7600378
As shown by the above statistics, the remaining explanatory variables are now all statistically highly significant and have confidence intervals well clear of the value one, so the model is now ready to be used for predictions.
The model can be used for predictions as follows:
# Predict the probability.
probabilities <- predict(m, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions,
# with both absolute and proportional numbers.
tbl <- table(high_use = alc$high_use, prediction = alc$prediction)
addmargins(tbl)
## prediction
## high_use FALSE TRUE Sum
## FALSE 246 24 270
## TRUE 66 46 112
## Sum 312 70 382
round(addmargins(prop.table(tbl)), 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64 0.06 0.71
## TRUE 0.17 0.12 0.29
## Sum 0.82 0.18 1.00
As the tables show, the model is too careful in its predictions; it predicts less occurrences of high use than what the actual data shows. This can also be demonstrated graphically:
hu <- as.data.frame(prop.table(table(alc$high_use)))
pred <- as.data.frame(prop.table(table(alc$prediction)))
pp1 <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('observed high use') + theme(legend.position = 'none')
pp2 <- ggplot(pred, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('predicted high use') + theme(legend.position = 'none')
multiplot(pp1, pp2, cols = 2)
The actual model training error can be calculated as follows (note that this is a function only because one is needed later on for cv.glm()):
mloss <- function(obs, prob) {
res <- ifelse(prob > 0.5, 1, 0)
mean(res != obs)
}
round(mloss(obs = alc$high_use, prob = alc$probability), 2)
## [1] 0.24
The training error is 24%, thus the model has a little over 75% accuracy. This isn’t perfect, but likely still better than any simple guessing strategy, given that by guessing alone I wasn’t able to predict my chosen variables’ statistical significance correctly.
To test the model further, cross-validation can be performed. The following performs a 10-fold cross-validation:
cv <- cv.glm(data = alc, cost = mloss, glmfit = m, K = 10)
cv$delta[1] # Print the average number of wrong predictions.
## [1] 0.2329843
The average training error of the 10-fold cross-validation is 0.23, which is already better performance than the model introduced in the DataCamp exercise has. However, the model could be improved further by adding more variables. This model, however, does not include sex as a variable, while according to the DataCamp exercise it is a statistically significant variable. Thus, we could try adding it and run the cross-validation to further improve the model:
m2 <- glm(high_use ~ absences + goout + studytime + sex, data = alc, family = "binomial")
summary(m2) # Summarise the model.
##
## Call:
## glm(formula = high_use ~ absences + goout + studytime + sex,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7586 -0.7653 -0.4897 0.7254 2.5996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.23486 0.59934 -5.397 6.76e-08 ***
## absences 0.06269 0.01687 3.716 0.000202 ***
## goout 0.74151 0.12061 6.148 7.86e-10 ***
## studytime -0.45081 0.17155 -2.628 0.008594 **
## sexM 0.81167 0.26937 3.013 0.002585 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 374.08 on 377 degrees of freedom
## AIC: 384.08
##
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m2)), exp(confint(m2))) # Count odds and their confidence intervals.
## 2.5 % 97.5 %
## (Intercept) 0.0393658 0.01172067 0.1236057
## absences 1.0646959 1.03217181 1.1038908
## goout 2.0990949 1.66738691 2.6783765
## studytime 0.6371102 0.45038109 0.8845879
## sexM 2.2516753 1.33375527 3.8439285
# Predict the probability.
probabilities2 <- predict(m2, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability2 = probabilities2)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction2 = probability2 > 0.5)
# Re-run cross-validation and print the average number of wrong predictions.
cv2 <- cv.glm(data = alc, cost = mloss, glmfit = m2, K = 10)
cv2$delta[1]
## [1] 0.2120419
The average training error is now only 0.21, thus clearly lower than in the previous model.
The data of this classification and clustering exercise was the Housing Values in Suburbs of Boston dataset, henceforth referred to just as Boston. It is available from the R package MASS, which includes Functions and datasets to support Venables and Ripley, “Modern Applied Statistics with S” (4th edition, 2002).. According to the reference manual of the package, the dataset includes 506 observations of the following 14 variables:
According to the reference manual, the data is based on the following sources:
After the MASS package has been loaded – by calling library(MASS) – the built-in datasets can be prepared simply by calling their names using data(). This enables accessing them by their names. They’re, however, loaded by using so-called lazy loading and thus only become data frames when they’re accessed for the first time (see help("data") for details):
# Prepare the data. (Quotes are optional but recommended; see help("data").)
data("Boston")
Glimpse the data to confirm it matches the reference manual:
# The data is only turned into an actual data frame at this point.
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, ...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, ...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, ...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 1...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 1...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 3...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15,...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 1...
Explore the data graphically:
# Subplot axis labels are partially too cramped, but I failed to find a working solution for that.
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
p
Show variable summaries:
summary(Boston)
## crim zn indus
## Min. : 0.00632 Min. : 0.00 Min. : 0.46
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19
## Median : 0.25651 Median : 0.00 Median : 9.69
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## chas nox rm age
## Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90
## 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
## Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50
## Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57
## 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
## Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00
## dis rad tax ptratio
## Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60
## 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40
## Median : 3.207 Median : 5.000 Median :330.0 Median :19.05
## Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46
## 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00
## black lstat medv
## Min. : 0.32 Min. : 1.73 Min. : 5.00
## 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
## Median :391.44 Median :11.36 Median :21.20
## Mean :356.67 Mean :12.65 Mean :22.53
## 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :396.90 Max. :37.97 Max. :50.00
As can be seen from the output, almost all of the variables variables are continuous, with the exception of the Charles River bound tract (chas) variable, which is binary, and the radial highways accessibility variable (rad), which is an index, but still measured on an interval level.
The distribution of most variables is rather skewed, except for the dwelling size (number of rooms; rm), which is normally distributed and median value of owner-occupied homes, which is nearly normally distributed. Some variables are very highly skewed, like the crime rate (crim), proportion of land zoned for very large lots (zn) and proportion of black people (black).
There are a lot of variables in the data, so it might be helpful to calculate a correlation matrix of the data and visualise it:
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", tl.pos="d", tl.cex=0.6)
A numerical equivalent of a correlation plot including only the highest correlations might be created as follows. I personally found this to be the most intuitive way to find the highest correlations:
cb <- as.data.frame(cor(Boston)) # Create a DF of the correlation matrix.
cor_matrix_high <- as.data.frame(matrix(nrow = 14, ncol = 14)) #Copy
colnames(cor_matrix_high) <- colnames(cor_matrix) #the structure of
rownames(cor_matrix_high) <- rownames(cor_matrix) #cor_matrix.
cor_threshold <- 0.7
# Loop through the correlation matrix and save only values that exceed the threshold.
for(col in names(cb)) {
for(row in 1:length(cb[[col]])) {
if(abs(cb[[col,row]]) > cor_threshold & abs(cb[[col,row]]) < 1) {
cor_matrix_high[col,as.character(rownames(cb)[row])] <- round(cb[[col,row]], digits = 2)
}
}
}
# Print the matrix.
cor_matrix_high
## crim zn indus chas nox rm age dis rad tax ptratio
## crim NA NA NA NA NA NA NA NA NA NA NA
## zn NA NA NA NA NA NA NA NA NA NA NA
## indus NA NA NA NA 0.76 NA NA -0.71 NA 0.72 NA
## chas NA NA NA NA NA NA NA NA NA NA NA
## nox NA NA 0.76 NA NA NA 0.73 -0.77 NA NA NA
## rm NA NA NA NA NA NA NA NA NA NA NA
## age NA NA NA NA 0.73 NA NA -0.75 NA NA NA
## dis NA NA -0.71 NA -0.77 NA -0.75 NA NA NA NA
## rad NA NA NA NA NA NA NA NA NA 0.91 NA
## tax NA NA 0.72 NA NA NA NA NA 0.91 NA NA
## ptratio NA NA NA NA NA NA NA NA NA NA NA
## black NA NA NA NA NA NA NA NA NA NA NA
## lstat NA NA NA NA NA NA NA NA NA NA NA
## medv NA NA NA NA NA NA NA NA NA NA NA
## black lstat medv
## crim NA NA NA
## zn NA NA NA
## indus NA NA NA
## chas NA NA NA
## nox NA NA NA
## rm NA NA NA
## age NA NA NA
## dis NA NA NA
## rad NA NA NA
## tax NA NA NA
## ptratio NA NA NA
## black NA NA NA
## lstat NA NA -0.74
## medv NA -0.74 NA
Variables tax (property taxes) and rad have a remarkably high correlation with each other: 0.91. This might mean that the taxation is at least partially based on the highway accessibility. Other relatively high correlations include e.g. the negative correlation between the amount of nitrogen oxides (nox) and rad (-0.77), positive correlation between industry presence (indus) and nox (0.76) and negative correlation between the proportion of pre-1940s buildings (age) and rad.
To further analyse the dataset, the dataset must be standardised, i.e. all variables fit to normal distribution so that the mean of every variable is zero. This can be done as follows:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681
## Median :-0.2723 Median :-0.1441 Median :-0.1084
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515
## age dis rad
## Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We can now see from the output of summary() that this works as intended. We also need to categorise our target variable – crim – to classify it:
# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
##
## low med_low med_high high
## 127 126 126 127
To create the LDA model and to test it, the data has to be divided into training and testing sets. This can be done as follows, choosing randomly 80% of the data to be used for training:
n <- nrow(boston_scaled) # Get number of rows in the dataset.
ind <- sample(n, size = n * 0.8) # Choose randomly 80% of the rows.
train <- boston_scaled[ind,] # Create train set.
test <- boston_scaled[-ind,] # Create test set.
# Save the correct classes from the test data.
correct_classes <- test$crime
# Remove the crime variable from the test data.
test <- dplyr::select(test, -crime)
With the data divided, it is now possible to fit the LDA model on the training set:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2549505 0.2475248 0.2549505
##
## Group means:
## zn indus chas nox rm
## low 1.0184911 -0.9045978 -0.11163110 -0.8803544 0.4609418
## med_low -0.1115449 -0.2825412 -0.04298342 -0.5663564 -0.1342048
## med_high -0.3654691 0.1162216 0.12138096 0.3080398 0.1880336
## high -0.4872402 1.0170891 -0.08120770 1.0403010 -0.4224060
## age dis rad tax ptratio
## low -0.8797037 0.8998531 -0.6888949 -0.7590069 -0.4583341
## med_low -0.3692625 0.3553846 -0.5581649 -0.4972214 -0.1090633
## med_high 0.3337648 -0.3404503 -0.3995984 -0.3146319 -0.2857040
## high 0.8237017 -0.8544174 1.6384176 1.5142626 0.7811136
## black lstat medv
## low 0.3743885 -0.7735376 0.55716848
## med_low 0.3153310 -0.1373100 0.01247855
## med_high 0.1364794 -0.0900935 0.27064972
## high -0.8639407 0.9305196 -0.70577202
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11479554 0.904438618 -0.86504028
## indus -0.01660852 -0.185060511 0.27291681
## chas -0.09728633 -0.040208581 0.08078549
## nox 0.48055410 -0.687486235 -1.41845418
## rm -0.10429272 -0.101895589 -0.11529294
## age 0.19818757 -0.174917384 -0.20719617
## dis -0.07036028 -0.307649720 0.13247805
## rad 3.14035711 1.148596993 -0.01097091
## tax 0.04428185 -0.360711423 0.58708293
## ptratio 0.12974038 0.045313224 -0.34570427
## black -0.13571599 -0.008593115 0.11399544
## lstat 0.24089482 -0.300760007 0.36337417
## medv 0.19441278 -0.431145684 -0.29709474
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0346 0.0131
As we used quantiles to categorise the original variable, we’ve four classes. Thus, the output shows that we’ve three linear discriminants, as expected. Of these, the first explains vast majority – 94% – of the between-group variance.
The first two of the model’s linear discriminants can be visualised follows. A helper function is needed to draw the arrows in the biplot:
# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.
It’s possible to visualise all three discriminants, but the lda.arrows() function is incompatible with that:
plot(lda.fit, dimen = 3, col = classes, pch = classes) # Plot.
We’ve already prepared the test set above, so it’s now possible to move straight into predictions:
lda.pred <- predict(lda.fit, newdata = test) # Predict the test values.
# Cross tabulate the predictions with the correct values.
addmargins(table(correct = correct_classes, predicted = lda.pred$class))
## predicted
## correct low med_low med_high high Sum
## low 17 11 1 0 29
## med_low 5 13 5 0 23
## med_high 0 4 21 1 26
## high 0 0 0 24 24
## Sum 22 28 27 25 102
(I used ```addmargins() when tabulating, because in my opinion that’s more illustrative and helps. comparisons.) As seen from the table, the model did predict the highest of crime rates reliably, but the “med_low” category is overrepresented relative to the “low” and “med_high” categories. Thus, the model can be used to make crude predictions, but it’s hardly perfect. It might be better to use an unsupervised method and cluster the data instead of classifying it.
To cluster the data, it needs to be loaded and standardised again, and a distance matrix created out of it:
boston_scaled <- as.data.frame(scale(Boston)) # Standardise the data.
dist_eu <- dist(boston_scaled) # Create an euclidian distance matrix.
summary(dist_eu) # Summarise the matrix.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
We can try to cluster the data with k-means straight away. We used four classes for our LDA model, so we might try it with as many clusters instead:
km <-kmeans(dist_eu, centers = 4) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.
However, while the results look somewhat reasonable, the amount of clusters was merely a guess. To determine it properly, the total within cluster sum of squares (TWCSS) should be calculated. Let’s try it, with a maximum of 15 clusters:
k_max <- 15 # Maximum number of clusters to try.
# Define a function for testing.
k_try <- function(k) {
kmeans(dist_eu, k)$tot.withinss
}
# Calculate the total within sum of squares using the function.
twcss <- sapply(1:k_max, k_try)
# Visualize the results.
plot(1:k_max, twcss, type='b')
The optimal number of cluster is where the TWCSS drops radically; however, by inspecting the above plot, it’s somewhat debatable, whether this happens with with just two or four clusters. Thus, for comparison, let’s re-cluster the data with just two clusters:
km <-kmeans(dist_eu, centers = 2) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.
As the plots above demonstrate, there seems to be less overlap between the clusters than with four clusters, which suggests that at least when using euclidian distance, the optimal number of clusters is indeed two. Because it’s possible that different distance measures produce different results, I also briefly tested my code by creating the dist_eu variable using the manhattan distance method instead, but found the results to be in this case so similar to the euclidian method that it’s not worth repeating those results here. The most notable difference was that with the manhattan method, the TWCSS plot hinted even more strongly that the optimal number of clusters is two.